-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CallbackGroup centric executor #2405
base: rolling
Are you sure you want to change the base?
Conversation
c32890c
to
3714f18
Compare
3714f18
to
ed6c50d
Compare
I worked further on this, and the executor is now usable and works stable in our system. I did a few performance measurements, and came our ahead of the SingleThreaded and the Multithreaded executors. Removed benchmark results, as they were taken with the CPU frequency scheduler not in performance mode. Please referee to the ones above. |
…cuters This allows us to use executers that are not derived from rclcpp::Executor. Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com>
Signed-off-by: Janosch Machowinski <J.Machowinski@cellumation.com> # Conflicts: # rclcpp/include/rclcpp/executor.hpp # rclcpp/src/rclcpp/executor.cpp # rclcpp/test/rclcpp/executors/test_executors.cpp
ed6c50d
to
27cc9ac
Compare
The likely explanation is that publishers/timers are running faster (thanks to the improved performance of the new executor) so you are publishing more messages. I would be interested in seeing a performance comparison with the Moreover, with a new executor proposal, we should have a detailed performance analysis and design document accompanying it. NOTE: I didn't look at the code yet. |
That would be nice, but I think something is broken with the shutdown.
Define 'faster' :-). The latency of the EventsExecutor is way better.
The main point of this executor is that it has similar behavior in single and multithreaded mode. Note, this is not the case for the current executors. That it is fast, and shows a reduction of CPU load is a nice addon.
I changes the design 4 times during the refinement of the approach... This started out as a request for feedback for an idea, and meanwhile evolved into a full blown reasonable stable executor. I would first like to discuss it, and if I can get the changes needed for it into rclcpp and rcl, before putting any work into design documents...
That would be great, thanks |
This makes sense. Usually the path forward for a new executor is the following
This PR can stay open (maybe as a draft) to get initial feedbacks from the maintainers, but I would prefer to start with a design document to fully understand the theory of operation before diving into the code. If the purpose of this executor is to "fix" a bug in the current single and multithreaded executors, I would rather ask to replace them and/or apply the changes to those executors themselves rather than creating a new one. On the other hand, if "having a similar behavior in single and multithreaded mode" isn't really a bug-fix, but rather an alternative approach (similar to how the events executor processes events in a different order than the single threaded executor), we could consider adding this new executor after the community has showed enough interest in it (e.g. contributions, discourse discussions, etc)
I used "faster" on purpose as a very generic term. We showed how the events executor had better performance than the standard executor across a large number of tests and metrics (CPU usage, latency, throughput, accuracy of timers, and slightly even RAM usage). We currently have 4 executors in the repo. I really don't want to go to 5 unless there's a very strong reason. |
I'll try to make some time to come up with a document.
Replacing Static, Single and Multi threaded Executor with this one would be feasible, but also a bit drastic. My idea was to add it as a possible replacement for transition and testing. I had a short look at the EventsExecutor and would say, if we merge it with this approach, we will get something really nice. The EventsExecutor has the solution to how avoid the waitsets, which take ages to build. On the other side, the cbg executor has the logic, to schedule ready events in callback groups with priorized order by multiple threads. The results should be fast, but behave similar to the current executor. |
@alsora I overworked the description, to contain a proposal and a detailed description of the issue that is should fix. I would be fine, with having this executor outside of rclcpp, but as mentioned above, I will need to adopt some APIs and structures slightly in rcl and rclcpp for this. As for my idea on how to combine this and the EventsExecutor approach, its basically 'take out the wait for work stuff from the spin function and fill the event queues from callbacks from RWM'. The housekeeping might be challenging though. |
Many ideas here, thanks for doing the leg-work and sorry for taking so long to begin thinking about this. One note, and something that @wjwwood keeps reminding me about our current implementation
This isn't so much a queue as a set. Basically, everything has a boolean state of "am I ready or not" rather than "how many things do I have ready". In the case that you had a subscription that received 1 message during the wait and one that received 100 messages, both will be processed equally (one callback per subscription). |
This is a first version for a new executor, that saves ready executables per callback group.
In doing so, this executor should have the same behavior as the SingleThreaded executor in the
Current behavior of the SingleThreadedExecutor
Currently the SingleThreadedExecutor will collect all ready events by calling wait_for_work
rclcpp/rclcpp/src/rclcpp/executor.cpp
Line 748 in c500695
This will clear the global internal event queue (within AllocatorMemoryStrategy) and repopulate it with ready events.
Afterwards, the SingleThreadedExecutor will process all elements from the global internal event queue
by calling
rclcpp/rclcpp/src/rclcpp/executor.cpp
Line 894 in c500695
followed by
rclcpp/rclcpp/src/rclcpp/executor.cpp
Line 538 in c500695
in a loop.
Note, get_next_ready_executable returns the ready events in a prioritized way e.g Timers etc.
If get_next_ready_executable returns false, the global event queue will be empty and wait_for_work will be called again.
In case that multiple events are ready at the time wait_for_work was called, this will lead to prioritized first in first out pattern, as events, that become ready while the processing happens are not added as wait_for_work will not be called, until all events have been processed.
Current behavior of the MultiThreadedExecutor
The MultiThreadedExecutor utilizes the same code as the SingleThreadedExecutor and also has a global internal event queue. It will also call wait_for_work
rclcpp/rclcpp/src/rclcpp/executor.cpp
Line 748 in c500695
to clear and populate the internal queue.
Afterwards, there will be multiple threads, calling
rclcpp/rclcpp/src/rclcpp/executor.cpp
Line 894 in c500695
rclcpp/rclcpp/src/rclcpp/executor.cpp
Line 538 in c500695
If get_next_ready_executable returns false, wait_for_work will be called again. This will also block all other threads from calling get_next_ready_executable until wait_for_work returns. At this point, there is a subtle difference, that the global event queue is not guaranteed to empty.
The difference comes from the facts that get_next_ready_executable will not return any event, that belongs to a mutal exclusive callback group, that is marked as being currently executed (by another thread).
Therefore we often see a pattern like this:
Wait for work is called, and multiple events of one callback group are ready and enqueued. One thread will take one event from the queue and start processing it. Meanwhile a second thread will come by and call get_next_ready_executable. As all events that are ready, belong to a callback group that is executing, get_next_ready_executable will return false, and wait_for_work will be called. This will clear the global event queue.
To avoid a life loop, wait_for_work will mask the callback group in execution from the wait set and block the thread.
After thread one has finished executing the event, it will mark the callback group as ready, and call a guard condition, that will wake up the thread blocked by wait_for_work.
Thread 2 will be woken up, by the guard condition, and collect all ready events. Mostly there will be no other event ready at this point, and get_next_ready_executable will return false, as wait_for_work cleared the event queue.
Afterwards thread 2 will call wait_for_work again, this time with the callback group added to the waitset, as it is now ready. As events are still ready in the callback group, that was masked before, rcl_wait will return instantly. Note, if meanwhile new events became ready, they will also be added to the event queue. Thread 2 will take the highest priority event afterwards and execute it.
The result will be a different pattern of execution of the events, leaning hard towards Priority first, in case of events that take a long time.
A really undesired behavior, that we observed a lot happens if you have a timer, that takes longer than its period.
In this case, on the timer will be executed in an endless loop, while no subscriptions, services etc will be processed any more by the executor.
A second undesired behavior comes up in combination with guard conditions. Guard conditions will wake up the wait_for_work function only once after each trigger. This leads to problems with Waitables, that internally use guard conditions, as they need to be retriggerd any time the Waitable is added to a waitset and there is still data available for processing. Bugs related this have been observe with the interprocess communication and a fix is currently worked on (#2420)
A third point would be, that wait_for_work is expensive to call and that the current logic calls it multiple times without a real need, thus increasing latency and CPU load.
Proposed solution
Instead of using one global event queue, have one event queue per callback group.
This will lead to the identical execution pattern, per callback group as the SingleThreadedExecutor, thus avoiding the starvation issues we are seeing with the multithreaded executor.
As Waitables using guard conditions internally will also be guaranteed to be executed after a guard condition trigger, this would also solve the issue of the lost events and lower the pitfalls of implementing Waitables.
Last, the CPU load and latency can be reduced, as we can safe a lot of unneeded calls to wait_for_work.
Proposed Implementation
In order, to be able to queue ready events by its callback group, we need to have a reverse mapping of rcl events in the waitset to the rclcpp Executable. This implementation archives this, by to extensions to rcl:
While adding the rcl events to the waitset, the user_data is set to a pointer to a std::variant.
The std::variant, contains the originating queue of the event, and the type of the event.
This is needed, as the source of every event is either [Subscription, Timer, Client, Service], or rclcpp:Waitable.
For timers, subscriptions, clients and services adding the mapping is straight forward.
As the executor has no way of knowing a prior what a Waitable will add, it safes the count of the added elements to the waitset before calling add_to_wait_set and computes the added elements from the count difference after the call. Afterwards every user_data pointer of the newly added elements will be set to a variant pointing to this Waitable.
After the call to rcl_wait, all ready events will be collected, and added to their origin event queue. Note, Waitables will only be added once, even if multiple rcl events trigger for it.
Wait_for_work
Housekeeping
Adds all callback groups with no unprocessed events to the waitset.
calls rcl_wait
Collects all ready events
Call notify one for each newly ready callback group, to wake up worker threads
get_next_ready_executable
This function is now threadsafe
Returns events in priority order from ready callback groups.
E.g. timers first if any ready in any ready callback group.
spin loop
If a thread has no work, it will call
get_next_ready_executable
-- execute
-- If executed event was last in callback group
--- wake up wait_for_work
Will try to acquire the wait_for_work_mutex
block on a std::conditional_variable
Performance results
The performance of the executor in respect to the existing ones depends on the number of added elements.
In general it
On low numbers (10 Nodes, each 1 Sub and 1 Pub)
On high numbers (10 Nodes, each 10 Subs and 10 Pubs)
On absurd high numbers (10 Nodes, each 30 Subs and 30 Pubs)
The iRobot Benchmark results :
topology/white_mountain.json
Note, there is something fishy going on, as the CbgExecutor receives more messages than the normal one, I am still locking into this.
@mjcarroll @wjwwood @clalancette @fujitatomoya Can you have a short look at this ?